129 research outputs found

    A Bayesian Approach to Learning Hidden Markov Model Topology with Applications to Biological Sequence Analysis

    Get PDF
    Hidden-Markov-Models (HMMs) are a widely and successfully used tool in statistical modeling and statistical pattern recognition. One fundamental problem in the application of HMMs is finding the underlying architecture or topology, particularly when there is no strong evidence from the application domain ā€” e.g., when doing black box modeling. Topology is important with regard to good parameter estimates and with regard to performance: A model with ā€œtoo manyā€ states ā€” and hence too many parameters ā€” requires too much training data while an model with ā€œnot enoughā€ states impedes the HMM from capturing subtle statistical patterns. We have developed a novel algorithm that, given sequence data originating from an ergodic process, infers an HMM, its topology and its parameters. We introduce a Bayesian approach

    An Algorithm to Select Target Specific Probes for DNA Chips

    Get PDF
    Motivation: The selection of target specific probes is a relevant problem in the design of DNA chips. Given a set S of genomic sequences, the task is to find at least one oligonucleotide, called probe, for each target sequence in S. This probe will be attached to the chip surface, and must be chosen in a way that it will not hybridize to any other sequence but the intended target. Furthermore, all probes on the chip must hybridize to their intended targets under the same reaction conditions, most importantly at the temperature T at which the experiment is conducted. Results: We present an efficient algorithm for the probe design problem. Melting temperatures are calculated for all possible probe-target interactions using an extended nearest-neighbor model, allowing for both non-Watson-Crick base-pairing and unpaired bases within a duplex. To compute temperatures efficiently, a combination of suffix trees and dynamic programming based alignment algorithms is introduced. Additional filtering steps during preprocessing increase the speed of the computation. Also, an algorithm to select the actual probes from the set of candidates is presented. The practicability of the algorithms is demonstrated by two case studies: The computation of probes for the identification of different HIV-1 subtypes, and finding probes for 28S rDNA sequences from over 400 organisms

    SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding

    Full text link
    Scaffolding is an important subproblem in "de novo" genome assembly in which mate pair data are used to construct a linear sequence of contigs separated by gaps. Here we present SLIQ, a set of simple linear inequalities derived from the geometry of contigs on the line that can be used to predict the relative positions and orientations of contigs from individual mate pair reads and thus produce a contig digraph. The SLIQ inequalities can also filter out unreliable mate pairs and can be used as a preprocessing step for any scaffolding algorithm. We tested the SLIQ inequalities on five real data sets ranging in complexity from simple bacterial genomes to complex mammalian genomes and compared the results to the majority voting procedure used by many other scaffolding algorithms. SLIQ predicted the relative positions and orientations of the contigs with high accuracy in all cases and gave more accurate position predictions than majority voting for complex genomes, in particular the human genome. Finally, we present a simple scaffolding algorithm that produces linear scaffolds given a contig digraph. We show that our algorithm is very efficient compared to other scaffolding algorithms while maintaining high accuracy in predicting both contig positions and orientations for real data sets.Comment: 16 pages, 6 figures, 7 table

    CLEVER: Clique-Enumerating Variant Finder

    Full text link
    Next-generation sequencing techniques have facilitated a large scale analysis of human genetic variation. Despite the advances in sequencing speeds, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals. Here we present a novel internal segment size based approach, which organizes all, including also concordant reads into a read alignment graph where max-cliques represent maximal contradiction-free groups of alignments. A specifically engineered algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions (indels). For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present various relevant performance statistics. We achieve superior performance rates in particular on indels of sizes 20--100, which have been exposed as a current major challenge in the SV discovery literature and where prior insert size based approaches have limitations. In that size range, we outperform even split read aligners. We achieve good results also on real data where we make a substantial amount of correct predictions as the only tool, which complement the predictions of split-read aligners. CLEVER is open source (GPL) and available from http://clever-sv.googlecode.com.Comment: 30 pages, 8 figure

    Partially-supervised protein subclass discovery with simultaneous annotation of functional residues

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The study of functional subfamilies of protein domain families and the identification of the residues which determine substrate specificity is an important question in the analysis of protein domains. One way to address this question is the use of clustering methods for protein sequence data and approaches to predict functional residues based on such clusterings. The locations of putative functional residues in known protein structures provide insights into how different substrate specificities are reflected on the protein structure level.</p> <p>Results</p> <p>We have developed an extension of the <it>context-specific independence </it>mixture model clustering framework which allows for the integration of experimental data. As these are usually known only for a few proteins, our algorithm implements a partially-supervised learning approach. We discover domain subfamilies and predict functional residues for four protein domain families: phosphatases, pyridoxal dependent decarboxylases, WW and SH3 domains to demonstrate the usefulness of our approach.</p> <p>Conclusion</p> <p>The partially-supervised clustering revealed biologically meaningful subfamilies even for highly heterogeneous domains and the predicted functional residues provide insights into the basis of the different substrate specificities.</p

    Gene expression trees in lymphoid development

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the lymphoid system due to its importance for basic biology and for clinical applications. Gene expression measured in lymphoid cells in several distinguishable developmental stages helps in the elucidation of underlying molecular processes, which change gradually over time and lock cells in either the B cell, T cell or Natural Killer cell lineages. Large-scale analysis of these <it>gene expression trees </it>requires computational support for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes.</p> <p>Results</p> <p>We present the first statistical framework designed to analyze gene expression data as it is collected in the course of lymphoid development through clusters of co-expressed genes and additional heterogeneous data. We introduce dependence trees for continuous variates, which model the inherent dependencies during the differentiation process naturally as gene expression trees. Several trees are combined in a mixture model to allow inference of potentially overlapping clusters of co-expressed genes. Additionally, we predict microRNA targets.</p> <p>Conclusion</p> <p>Computational results for several data sets from the lymphoid system demonstrate the relevance of our framework. We recover well-known biological facts and identify promising novel regulatory elements of genes and their functional assignments. The implementation of our method (licensed under the GPL) is available at <url>http://algorithmics.molgen.mpg.de/Supplements/ExpLym/</url>.</p

    Bayesian localization of CNV candidates in WGS data within minutes

    Get PDF
    Background: Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward-Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice. Results: In this paper, we propose an improved algorithmic framework for this approach. We provide new space-efficient data structures to query sufficient statistics in logarithmic time, based on a linear-Time, in-place transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler. Conclusions: Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop

    Constrained mixture estimation for analysis and robust classification of clinical time series

    Get PDF
    Motivation: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known theoretical issues such as high dimension of feature spaces versus few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focus on classification of interferon-Ī² (IFNĪ²) treatment response in Multiple Sclerosis (MS) patients which has attracted substantial attention in the recent past. Half of the patients remain unaffected by IFNĪ² treatment, which is still the standard. For them the treatment should be timely ceased to mitigate the side effects
    • ā€¦
    corecore